student <- read.table("student-mat.csv", sep=";", header=T)
student <- as_tibble(student)

student_all <- read.table("student-mat.csv", sep=";", header=T)
student_all <- as_tibble(student)

Introduction

Portugal, in 2006, had an early school leaving rate of nearly 40% for 18-24 year olds, while the EU average was 15% (Eurostat 2007). Various analyses and interventions, therefore, have targeted school success in Portuguese adolescents and attempted to improve education overall. Portuguese students are lacking in mathematics performance, specifically.

We are therefore analyzing student data from two Portuguese secondary schools. This dataset measures mathematics course grade as the outcome, as a result of 30 predictors that range from demographic factors and family situation to student habits and extracurricular involvement. Specifically, data was collected from 395 students across an age range of 15-22 years.

Portuguese schools have a five-level grade classification system. With a final score measured from 0-20, categories are as follows: Excellent (16-20), Good (14-15), Satisfactory (12-13), Sufficient (10-11), Fail (0-9). For the purposes of this analysis, however, we analyzed exact score rather than score category.

In this analysis, we wish to predict student success in secondary school math as a result of two types of factors:

  1. Unchangeable (family situation, parents’ jobs, home location, etc.)
  2. Targetable behavior (alcohol consumption, absences, extracurricular involvement, etc.)

The potential outcomes of this analysis are:

  1. Identification of student behaviors that can be targeted through school-related programs or interventions in order to increase success
  2. Indentification of inherent factors present in certain students’ backgrounds that may put them at a disadvantage to begin with. Early identification without discrimination could result in these students being helped more effectively along the way.

Data Description

The inital covariates in the dataset are:

  1. SCHOOL - student’s school (binary: “GP” - Gabriel Pereira or “MS” - Mousinho da Silveira)
  2. SEX - student’s sex (binary: “F” - female or “M” - male)
  3. AGE - student’s age (numeric: from 15 to 22)
  4. ADDRESS - student’s home address type (binary: “U” - urban or “R” - rural)
  5. FAMSIZE - family size (binary: “LE3” - less or equal to 3 or “GT3” - greater than 3)
  6. PSTATUS - parent’s cohabitation status (binary: “T” - living together or “A” - apart)
  7. MEDU - mother’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  8. FEDU - father’s education (numeric: 0 - none, 1 - primary education (4th grade), 2 – 5th to 9th grade, 3 – secondary education or 4 – higher education)
  9. MJOB - mother’s job (nominal: “teacher”, “health” care related, civil “services” (e.g. administrative or police), “at_home” or “other”)
  10. FJOB - father’s job (nominal: “teacher”, “health” care related, civil “services” (e.g. administrative or police), “at_home” or “other”)
  11. REASON - reason to choose this school (nominal: close to “home”, school “reputation”, “course” preference or “other”)
  12. GUARDIAN - student’s guardian (nominal: “mother”, “father” or “other”)
  13. TRAVELTIME - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
  14. STUDYTIME - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
  15. FAILURES - number of past class failures (numeric: n if 1<=n<3, else 4)
  16. SCHOOLSUP - extra educational support (binary: yes or no)
  17. FAMSUP - family educational support (binary: yes or no)
  18. PAID - extra paid classes within the course subject (Math or Portuguese) (binary: yes or no)
  19. ACTIVITIES - extra-curricular activities (binary: yes or no)
  20. NURSERY - attended nursery school (binary: yes or no)
  21. HIGHER - wants to take higher education (binary: yes or no)
  22. INTERNET - Internet access at home (binary: yes or no)
  23. ROMANTIC - with a romantic relationship (binary: yes or no)
  24. FAMREL - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
  25. FREETIME - free time after school (numeric: from 1 - very low to 5 - very high)
  26. GOOUT - going out with friends (numeric: from 1 - very low to 5 - very high)
  27. DALC - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
  28. WALC - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
  29. HEALTH - current health status (numeric: from 1 - very bad to 5 - very good)
  30. ABSENCES - number of school absences (numeric: from 0 to 93)

And the response variables are:

  1. G1 - first period grade (numeric: from 0 to 20)
  2. G2 - second period grade (numeric: from 0 to 20)
  3. G3 - final grade (numeric: from 0 to 20, output target)

For this analysis, we will be considering the G3 final grade as our response.

Expoloratory Data Analysis and Variable Filtering

Distribution of the Response Variable

student %>% 
  ggplot(aes(G3)) + geom_histogram(binwidth = 1, color='black', fill = 'grey') + 
  geom_vline(xintercept = mean(student$G3), lwd = 1, color = 'red') +
  labs(title = "Distribution of Final Math Grade", x="Final Grade", y="Count")

student %>% select(G1, G2, G3) %>% filter(G3 ==0 & G1 ==0 & G2 ==0)
## # A tibble: 0 x 3
## # … with 3 variables: G1 <int>, G2 <int>, G3 <int>

Let’s remove the students with a grade of 0, as that doesn’t make much practical sense, especially since all of these students have grades higher than 0 first and/or second period.

**BB: agree with this approach, if we are feeling adventurous we could treat those 0’s as missing and impute them but let’s not over complicate things yet # MICE code if we decide to impute ## student <- read.table(“student-mat.csv”, sep=“;”, header=T) student <- as.data.frame(student)

student\(G3[student\)G3 == 0] <- NA md.pattern(as.data.frame(student)) #38 students missing G3, almost 10% of data is missing imputed_data <- mice(student, m=5, maxit = 50, method = “pmm”, seed = 500) summary(imputed_data) imputed_data\(imp\)G3 imputed<-complete(imputed_data)

Correlation for all continuous variables

#Note this throws a warning about variables being non-numeric but still produces output for all numeric arguments. Also gives additional reason to look at G3 only as G1/G2 are highly correlated with G3
ggcorr(student)

student <- student %>% filter(G3 != 0) %>% select(-G1, -G2)
student_all <- student_all %>% select(-G1, -G2)
student %>% 
  ggplot(aes(G3)) + geom_histogram(binwidth = 1, color='black', fill = 'grey') + 
  geom_vline(xintercept = mean(student$G3), lwd = 1, color = 'red') +
  labs(title = "Distribution of Final Math Grade", x="Final Grade", y="Count")

Any difference in correlation after we drop G3=0 records?

#BB: I don't see any real differences
ggcorr(student)

Covariates and their relation to Final Grade

student %>% 
  gather(key = "Variable", value = "Value", -G3) %>% 
  ggplot(aes(Value, G3)) + facet_wrap(~ Variable, scales = "free") + 
  geom_boxplot()

## visualizing absences
student %>%
  # filter(absences < 23) %>%
  mutate(absences = factor(absences)) %>%
  ggplot(aes(absences, G3)) +
  geom_boxplot()

ggplot(student, aes(absences, G3)) +
  geom_jitter() +
  geom_smooth() +
  labs(title = "Final Grade by School Absences", x = "Number of school absences", y = "Final grade")

Since the variable “absences” has a greater range of values than most of the other variables, and since after 22 absences the data becomes extremely sparse, we decided to group 23-93 absences within the value “23.”

## Pooling all absences greater than 22 into the 23 group
student <- student %>% 
  mutate(absences = case_when(
                      absences > 22 ~ 23L,
                      TRUE ~ .$absences))

## visualizing new absences variable
student %>%
  # filter(absences < 23) %>%
  mutate(absences = factor(absences)) %>%
  ggplot(aes(absences, G3)) +
  geom_boxplot()

ggplot(student, aes(absences, G3)) +
  geom_jitter() +
  geom_smooth()

“Absences” still has a much greater spread of values than the other variables, so we will try binning them to see how that effects our models.

## Binning absences into 5 groups with 5 absence numbers each (except for the last group that includes all absences greater than or equal to 20).

# Binning
student <- student %>% 
  mutate(absences = case_when(
                      absences < 5 ~ 1L, 
                      absences >= 5 & absences < 10 ~ 2L,
                      absences >= 10 & absences < 15 ~ 3L,
                      absences >= 15 & absences < 20 ~ 4L,
                      absences >= 20 ~ 5L,
                      TRUE ~ .$absences))

student_all <- student_all %>% 
  mutate(absences = case_when(
                      absences < 5 ~ 1L, 
                      absences >= 5 & absences < 10 ~ 2L,
                      absences >= 10 & absences < 15 ~ 3L,
                      absences >= 15 & absences < 20 ~ 4L,
                      absences >= 20 ~ 5L,
                      TRUE ~ .$absences))

## visualizing new absences variable
student %>%
  # filter(absences < 23) %>%
  mutate(absences = factor(absences)) %>%
  ggplot(aes(absences, G3)) +
  geom_boxplot()

ggplot(student, aes(absences, G3)) +
  geom_jitter() +
  geom_smooth()

student %>% 
  mutate(absences = factor(absences)) %>% 
  ggplot(aes(absences, G3, fill = absences)) +
  geom_violin(color = "gray") +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "Final Grade by School Absences", x = "School Absences", y = "Final grade") +
  scale_fill_brewer(palette="Pastel2") +
  theme(legend.position="none") +
  scale_x_discrete(labels=c("<5", "5-9", "10-14", "15-19", ">20"))

## visualizing alcohol consumption

# weekend alcohol consumption (5 is highest)
student %>% 
  mutate(Walc = factor(Walc)) %>% 
  ggplot(aes(Walc, G3)) +
  geom_violin() +
  geom_jitter(width = 0.2, alpha = 0.5)

# workday alcohol consumption (5 is highest)
student %>% 
  mutate(Dalc = factor(Dalc)) %>% 
  ggplot(aes(Dalc, G3, fill = Dalc)) +
  geom_violin(color = "gray") +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "Final Grade by Workday Alcohol Consumption", x = "Level of workday alcohol consumption (5 is very high)", y = "Final grade") +
  scale_fill_brewer(palette="Pastel2") +
  theme(legend.position="none")

## Time spent Studying
student %>% 
  mutate(studytime = factor(studytime)) %>% 
  ggplot(aes(studytime, G3, fill = studytime)) +
  geom_violin(color = "gray") +
  geom_jitter(width = 0.2, alpha = 0.5) +
  labs(title = "Final Grade by Time Spent Studying", x = "Weekly Study Time", y = "Final grade") +
  scale_fill_brewer(palette="Pastel2") +
  theme(legend.position="none") +
  scale_x_discrete(labels=c("<2 hours", "2-5 hours", "5-10 hours", ">10 hours"))

## Time spent travelling
student %>% 
  mutate(traveltime = factor(traveltime)) %>% 
  ggplot(aes(traveltime, G3)) +
  geom_violin() +
  geom_jitter(width = 0.2, alpha = 0.5)

## Guardian
student %>% 
  mutate(guardian = factor(guardian)) %>% 
  ggplot(aes(guardian, G3)) +
  geom_violin() +
  geom_jitter(width = 0.2, alpha = 0.5)

Based on these plots, we decided to remove the variable “guardian” from our dataset since it didn’t seem to have any influence on the outcome.

We also observed that students in this dataset were from two different schools. Using school in our prediction of math course success, although it may increase prediction accuracy, would make our analysis less generalizable to the overall population of students. Since the overarching goal of this analysis is to identify factors that systematically affect secondary school student success in math courses, we decided to remove school as a predictor variable.

student$age <- student$age - min(student$age)

student <- student %>% 
  select(-guardian) %>%
  select(-school)

student_all <- student_all %>% 
  select(-guardian) %>%
  select(-school)

Initial BIC/R^2

Just to get an initial sense of influential variables we can look at all variables up to this point and examine fit statistics. This bit of code will look at permutations of all possible models up to nvmax and then report the best model based on fit criteria. For example, with nvmax of 30 it will produce the best model for 1 covariate, 2 covariates…up to the full model.

#Based on results the best models fall within 5-20 vars, so I only considered 20 variables in this run
best_subset <- leaps::regsubsets(G3 ~ ., student, nvmax = 20, method='exhaustive')
results<-summary(best_subset)

tibble(predictors = 1:20,
       adj_R2 = results$adjr2,
       Cp = results$cp,
       BIC = results$bic) %>%
  gather(statistic, value, -predictors) %>%
  ggplot(aes(predictors, value, color = statistic)) +
  geom_line(show.legend = F) +
  geom_point(show.legend = F) +
  facet_wrap(~ statistic, scales = "free")

#which.max(results$adjr2)
# 17 variables
#which.min(results$bic)
# 8 variables
#which.min(results$cp)
# 14 variables

Looking at coefficients from best subset above based on BIC

Notice based on BIC while the “best” option has 8 covariates, we really are not giving up much in the way of BIC using 7 covariates. We have presented both sets of coefficients here and generally want the more parsimonious model when all other criteria are the same.

coef(best_subset, 8)
##  (Intercept)   Mjobhealth Mjobservices  Fjobteacher     failures 
##   14.6123286    1.7919419    1.4684797    2.0295712   -1.1000624 
## schoolsupyes        goout       health     absences 
##   -2.3415181   -0.4729677   -0.2581331   -0.4284993
coef(best_subset, 7)
##  (Intercept)   Mjobhealth Mjobservices  Fjobteacher     failures 
##   13.6819891    1.6852313    1.3837170    2.0288848   -1.1282540 
## schoolsupyes        goout     absences 
##   -2.3052617   -0.4655962   -0.4141869
# We lose health when we drop to 7 variables. Otherwise all variables remain the same and are very similar in their effect size.

Collinearity Analysis

VIF and Tolerance

covariates_int <- model.matrix(G3 ~ ., data = student )
covariates <- covariates_int %>% as.data.frame() %>% select(-`(Intercept)`)
student_d <- data.frame(G3 = student$G3, covariates)
model1 <- lm(G3 ~ ., data = student_d)
VIF = car::vif(model1)
Tol = 1/VIF
df = data.frame(Variable = names(VIF), VIF = VIF, Tolerance = Tol)
df %>% arrange(desc(VIF)) %>% head(10) %>% knitr::kable(align = c("c", "c"))
Variable VIF Tolerance
Fjobother 6.342636 0.1576631
Fjobservices 5.576032 0.1793390
Mjobteacher 3.294431 0.3035425
Mjobservices 3.086430 0.3239989
Mjobother 2.919526 0.3425214
Medu 2.799083 0.3572598
Fjobteacher 2.766049 0.3615265
Walc 2.443405 0.4092649
Mjobhealth 2.423100 0.4126945
Fjobhealth 2.275655 0.4394340
#set up dataframe for all students dataset:
covariates_int_all <- model.matrix(G3 ~ ., data = student_all )
covariates_all <- covariates_int_all %>% as.data.frame() %>% select(-`(Intercept)`)
student_d_all <- data.frame(G3 = student_all$G3, covariates_all)

Displayed above are the covariates with the highest Variance Inflation Factors.

EigenAnalysis for the Correlation and Scaled SSCP Matrices

xtx = t(covariates_int)%*%covariates_int
Ds_half <- diag(diag(xtx)^-0.5)
sscp <- Ds_half %*% xtx %*% Ds_half
eig_sscp <- eigen(sscp)$values
PCs_sscp <- prcomp(sscp)[2]
CI_sscp <- sqrt(eig_sscp[1]/eig_sscp)

cov_center <- apply(covariates, 2, function(y) y - mean(y))
C <- (t(cov_center)%*%cov_center)/dim(cov_center)[1]
Dc_half <- diag(diag(C)^-0.5)
R <- Dc_half %*% C %*% Dc_half
eig_corr <- eigen(R)$values
CI_corr <- sqrt(eig_corr[1]/eig_corr)
PCs_corr <- prcomp(R)[2]

df <- data.frame("Eigenvalue" = c(eig_corr), "Condition Index" = c(CI_corr))
df2 <- data.frame("Eigenvalue" = c(eig_sscp), "Condition Index" = c(CI_sscp))

df %>% tail(2) %>% knitr::kable(align = c("c", "c"))
Eigenvalue Condition.Index
35 0.1042497 5.546255
36 0.0685997 6.837166
df2 %>% filter(`Condition.Index` > 30) %>% knitr::kable(align = c("c", "c"))
Eigenvalue Condition.Index
0.0063958 57.15686

From the Correlation Matrix, the highest Condition Index is still very small, which means no collinearity needs to be taken care of there. From the Scaled SSCP Matrix, there are two Condition Indices over 30.

PCs_sscp$rotation[1:17,36:37]
##              PC36         PC37
##  [1,] -0.54238432 -0.768156681
##  [2,]  0.01500628  0.001119414
##  [3,] -0.01588949  0.052591238
##  [4,]  0.04850298  0.010901229
##  [5,] -0.01501923  0.024552007
##  [6,]  0.06871137  0.031398323
##  [7,]  0.36810775 -0.223722976
##  [8,] -0.02014247  0.047035204
##  [9,] -0.15226341  0.102934232
## [10,] -0.18494245  0.141719926
## [11,] -0.19775873  0.138712109
## [12,] -0.20887103  0.139184678
## [13,] -0.11254029  0.125983102
## [14,] -0.34217703  0.393506844
## [15,] -0.22943971  0.277311183
## [16,] -0.14535872  0.157940062
## [17,] -0.02812198  0.031426805

Above shows the last 2 PC’s of the Scaled SSCP Matrix for the 1st-17th covariates (the rest have values close to 0). We can see that the 8th and 11-17th covariates may be collinear with the intercept. These covariates are listed below.

colnames(student_d)[c(7, 10:16)]
## [1] "Medu"         "Mjobother"    "Mjobservices" "Mjobteacher" 
## [5] "Fjobhealth"   "Fjobother"    "Fjobservices" "Fjobteacher"

Many of these are factor variables that have reference values in the itnercept, so it makes sense they migh tbe collinear. Not going to remove any.

Covariates of Potential Interest

Models and Model Testing

There are many types of models to consider when building a regression classifier, such as Multiple Linear Regression, ANOVA, Linear Mixed Models, and Logistic/Poisson Regression. Classically, Linear Mixed Models are used when observations are related and Logistic/Poisson Regression are used for binrary outcomes. Since we are assuming our observations (the students) are independent for our purpose, and the outcome variable (final grade) is numeric, we narrow our model search down to Multiple Linear Regression and ANOVA. Since, at least to start, we have many factor-type covariates, we begin exploring Multiple Linear Regression models and their fit to our data.

Model 1: Full Model

*BB: since categorical vars are being grouped into several vars we have more than 30 Betas. I count 39 below.

The first model we consider is a full model of all covariates possible and no interaction terms included. The model can be written as follows:

\[y = \beta_0 + \beta_1X_1 + ... + \beta_{37}X_{37} + \epsilon\] where \(\beta_i\) represents … and \(X_i\) represents a covariate. There are more covariates than expected because each factor variable is recreated as a series of “dummy” variables. If a factor has \(n\) levels, then there will be \(n-1\) variables, and therefore parameters, corresponding to that covariate.

summary(model1)
## 
## Call:
## lm(formula = G3 ~ ., data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5064 -1.8720  0.0293  1.8227  6.9600 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      12.61184    1.74233   7.238 3.39e-12 ***
## sexM              0.57764    0.35502   1.627  0.10471    
## age              -0.18458    0.14030  -1.316  0.18923    
## addressU          0.48945    0.40361   1.213  0.22615    
## famsizeLE3        0.24556    0.34389   0.714  0.47571    
## PstatusT         -0.24239    0.50396  -0.481  0.63087    
## Medu              0.17054    0.22593   0.755  0.45090    
## Fedu              0.13734    0.19477   0.705  0.48125    
## Mjobhealth        1.14310    0.80393   1.422  0.15603    
## Mjobother        -0.58096    0.52655  -1.103  0.27071    
## Mjobservices      0.84323    0.58848   1.433  0.15287    
## Mjobteacher      -0.71277    0.74734  -0.954  0.34094    
## Fjobhealth       -0.63368    1.01710  -0.623  0.53371    
## Fjobother        -0.63043    0.74669  -0.844  0.39913    
## Fjobservices     -0.79711    0.77579  -1.027  0.30497    
## Fjobteacher       1.18464    0.94423   1.255  0.21053    
## reasonhome        0.26472    0.39912   0.663  0.50765    
## reasonother      -0.09380    0.57353  -0.164  0.87019    
## reasonreputation  0.13815    0.41101   0.336  0.73699    
## traveltime        0.02845    0.23921   0.119  0.90542    
## studytime         0.51548    0.20779   2.481  0.01362 *  
## failures         -0.82301    0.25773  -3.193  0.00155 ** 
## schoolsupyes     -2.42939    0.46632  -5.210 3.40e-07 ***
## famsupyes        -0.63394    0.33989  -1.865  0.06308 .  
## paidyes          -0.44278    0.33601  -1.318  0.18852    
## activitiesyes     0.12753    0.32141   0.397  0.69179    
## nurseryyes       -0.27189    0.39819  -0.683  0.49522    
## higheryes         0.39340    0.85823   0.458  0.64699    
## internetyes       0.61575    0.44649   1.379  0.16883    
## romanticyes      -0.09634    0.34193  -0.282  0.77832    
## famrel            0.12137    0.17748   0.684  0.49457    
## freetime          0.10646    0.16699   0.638  0.52424    
## goout            -0.45766    0.16638  -2.751  0.00628 ** 
## Dalc              0.05727    0.23179   0.247  0.80502    
## Walc             -0.15349    0.17833  -0.861  0.39005    
## health           -0.23289    0.11382  -2.046  0.04156 *  
## absences         -0.36947    0.14796  -2.497  0.01303 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.787 on 320 degrees of freedom
## Multiple R-squared:  0.3296, Adjusted R-squared:  0.2542 
## F-statistic: 4.371 on 36 and 320 DF,  p-value: 2.751e-13

** ADJUSTED R2 of 0.256 BOO **

Diagnostics

In order to determine the validity of this model, we must test the assumptions of linear regression. These assumptions include homogeneity of variances and gaussian errors.

stud_resid = rstudent(model1)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model1$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
##        RESID      ABS      PRED
## 1   2.698348 2.698348 12.039974
## 2   2.544754 2.544754 10.371589
## 3   2.516308 2.516308 11.339381
## 4  -2.454889 2.454889 11.506401
## 5   2.421202 2.421202 11.522224
## 6   2.419423 2.419423 11.716532
## 7   2.398537 2.398537 12.654729
## 8  -2.324983 2.324983 12.230358
## 9   2.276336 2.276336 12.075219
## 10  2.260274 2.260274 13.063600
## 11 -2.241573 2.241573 11.990685
## 12  2.203381 2.203381  8.205372
## 13 -2.125594 2.125594 11.673253
## 14  2.118535 2.118535 12.430768
## 15  2.057514 2.057514 12.684111
## 16 -2.048861 2.048861 11.302894
## 17 -2.026352 2.026352 12.250541
## 18 -2.022032 2.022032 11.184054

Since this set of largest studentized residuals are all greater than 2, it may be of interest to examine them in further detail. ** this part might be better in the outlier detection section **

Gaussian Errors Assumption
nortest::ad.test(dat_resid$RESID)
## 
##  Anderson-Darling normality test
## 
## data:  dat_resid$RESID
## A = 0.29059, p-value = 0.6087

From the code above, since the p-value 0.4184 is greater than \(\alpha = 0.05\), we accept the null hypothesis that the residuals are normally distributed.

Homogeneity of Variances (and Gaussian Errors) Assumptions
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)

Model 2: No Intercept

Since, from the collinearity analysis, several variables were significantly colinear with the intercept, we next try to build a model identical to the full model with the removal of the intercept.

model2 <- lm(G3 ~ . -1, data = student_d)
summary(model2)
## 
## Call:
## lm(formula = G3 ~ . - 1, data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5506 -1.8695  0.1517  1.8989  8.0781 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## sexM              0.754517   0.381480   1.978  0.04880 *  
## age               0.014509   0.148178   0.098  0.92206    
## addressU          1.105820   0.424932   2.602  0.00969 ** 
## famsizeLE3        0.496159   0.368521   1.346  0.17914    
## PstatusT          0.768732   0.521541   1.474  0.14147    
## Medu              0.317803   0.242360   1.311  0.19070    
## Fedu              0.332139   0.207772   1.599  0.11090    
## Mjobhealth        0.780292   0.864211   0.903  0.36726    
## Mjobother        -0.506167   0.567026  -0.893  0.37270    
## Mjobservices      0.668814   0.633313   1.056  0.29174    
## Mjobteacher      -1.176189   0.801990  -1.467  0.14347    
## Fjobhealth        1.128215   1.063670   1.061  0.28963    
## Fjobother         1.261640   0.753360   1.675  0.09497 .  
## Fjobservices      1.188711   0.781597   1.521  0.12928    
## Fjobteacher       2.942862   0.982784   2.994  0.00296 ** 
## reasonhome        0.452118   0.428985   1.054  0.29271    
## reasonother       0.561494   0.610001   0.920  0.35801    
## reasonreputation  0.277318   0.442207   0.627  0.53102    
## traveltime        0.506018   0.247660   2.043  0.04185 *  
## studytime         0.895230   0.216552   4.134 4.55e-05 ***
## failures         -0.619055   0.275930  -2.244  0.02554 *  
## schoolsupyes     -2.093990   0.499782  -4.190 3.61e-05 ***
## famsupyes        -0.610801   0.366076  -1.669  0.09619 .  
## paidyes          -0.542762   0.361601  -1.501  0.13434    
## activitiesyes     0.050217   0.345997   0.145  0.88469    
## nurseryyes        0.120396   0.424898   0.283  0.77709    
## higheryes         3.253194   0.820618   3.964 9.08e-05 ***
## internetyes       0.795038   0.480166   1.656  0.09875 .  
## romanticyes      -0.188022   0.368039  -0.511  0.60979    
## famrel            0.464018   0.184239   2.519  0.01227 *  
## freetime          0.389599   0.174863   2.228  0.02657 *  
## goout            -0.440626   0.179184  -2.459  0.01446 *  
## Dalc             -0.005198   0.249484  -0.021  0.98339    
## Walc             -0.031654   0.191219  -0.166  0.86863    
## health           -0.088728   0.120704  -0.735  0.46282    
## absences         -0.260683   0.158545  -1.644  0.10111    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.002 on 321 degrees of freedom
## Multiple R-squared:  0.9434, Adjusted R-squared:  0.937 
## F-statistic: 148.6 on 36 and 321 DF,  p-value: < 2.2e-16

** ADJUSTED R2 OF .9421 !!!!!! **

Diagnostics

stud_resid = rstudent(model2)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model2$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
##        RESID      ABS      PRED
## 1   2.907693 2.907693 10.921897
## 2   2.535449 2.535449  6.816460
## 3   2.531060 2.531060  9.895896
## 4   2.475705 2.475705 11.073511
## 5   2.449680 2.449680 11.011177
## 6   2.424527 2.424527 12.089761
## 7  -2.268277 2.268277 12.550590
## 8   2.233351 2.233351 11.676215
## 9  -2.215128 2.215128 11.334789
## 10  2.168795 2.168795 11.738572
## 11  2.100816 2.100816  7.033648
## 12 -2.080427 2.080427 12.806819
## 13  2.079162 2.079162 13.036056
## 14 -2.050184 2.050184 11.897140
## 15 -2.044874 2.044874 11.702064
nortest::ad.test(dat_resid$RESID)
## 
##  Anderson-Darling normality test
## 
## data:  dat_resid$RESID
## A = 0.33791, p-value = 0.503
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)

Model 3: Stepwise Model

set.seed(12345)
modelback <- step(model1, direction = "backward", trace = F)
modelback
## 
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices + 
##     Fjobteacher + studytime + failures + schoolsupyes + famsupyes + 
##     goout + health + absences, data = student_d)
## 
## Coefficients:
##  (Intercept)          sexM           age      addressU    Mjobhealth  
##      13.1670        0.6631       -0.2024        0.6295        1.7986  
## Mjobservices   Fjobteacher     studytime      failures  schoolsupyes  
##       1.4065        2.1208        0.5661       -0.9418       -2.3196  
##    famsupyes         goout        health      absences  
##      -0.6310       -0.4774       -0.2617       -0.3240
modelboth <- step(model1, direction = "both", trace = F)
modelboth
## 
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices + 
##     Fjobteacher + studytime + failures + schoolsupyes + famsupyes + 
##     goout + health + absences, data = student_d)
## 
## Coefficients:
##  (Intercept)          sexM           age      addressU    Mjobhealth  
##      13.1670        0.6631       -0.2024        0.6295        1.7986  
## Mjobservices   Fjobteacher     studytime      failures  schoolsupyes  
##       1.4065        2.1208        0.5661       -0.9418       -2.3196  
##    famsupyes         goout        health      absences  
##      -0.6310       -0.4774       -0.2617       -0.3240
modelforward <- step(model1, direction = "forward", trace = F)
modelforward 
## 
## Call:
## lm(formula = G3 ~ sexM + age + addressU + famsizeLE3 + PstatusT + 
##     Medu + Fedu + Mjobhealth + Mjobother + Mjobservices + Mjobteacher + 
##     Fjobhealth + Fjobother + Fjobservices + Fjobteacher + reasonhome + 
##     reasonother + reasonreputation + traveltime + studytime + 
##     failures + schoolsupyes + famsupyes + paidyes + activitiesyes + 
##     nurseryyes + higheryes + internetyes + romanticyes + famrel + 
##     freetime + goout + Dalc + Walc + health + absences, data = student_d)
## 
## Coefficients:
##      (Intercept)              sexM               age          addressU  
##         12.61184           0.57764          -0.18458           0.48945  
##       famsizeLE3          PstatusT              Medu              Fedu  
##          0.24556          -0.24239           0.17054           0.13734  
##       Mjobhealth         Mjobother      Mjobservices       Mjobteacher  
##          1.14310          -0.58096           0.84323          -0.71277  
##       Fjobhealth         Fjobother      Fjobservices       Fjobteacher  
##         -0.63368          -0.63043          -0.79711           1.18464  
##       reasonhome       reasonother  reasonreputation        traveltime  
##          0.26472          -0.09380           0.13815           0.02845  
##        studytime          failures      schoolsupyes         famsupyes  
##          0.51548          -0.82301          -2.42939          -0.63394  
##          paidyes     activitiesyes        nurseryyes         higheryes  
##         -0.44278           0.12753          -0.27189           0.39340  
##      internetyes       romanticyes            famrel          freetime  
##          0.61575          -0.09634           0.12137           0.10646  
##            goout              Dalc              Walc            health  
##         -0.45766           0.05727          -0.15349          -0.23289  
##         absences  
##         -0.36947

Backward and Both give the same variables.

Forward gives all the variables.

DOES NOT give same exact answer if we use model1 with intercept or model2 without to do the stepwise process.

summary(modelback)
## 
## Call:
## lm(formula = G3 ~ sexM + age + addressU + Mjobhealth + Mjobservices + 
##     Fjobteacher + studytime + failures + schoolsupyes + famsupyes + 
##     goout + health + absences, data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.9070 -1.9329 -0.1166  1.8534  7.3467 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.1670     0.8580  15.347  < 2e-16 ***
## sexM           0.6631     0.3151   2.104 0.036096 *  
## age           -0.2024     0.1294  -1.564 0.118670    
## addressU       0.6295     0.3612   1.743 0.082292 .  
## Mjobhealth     1.7986     0.5323   3.379 0.000811 ***
## Mjobservices   1.4065     0.3444   4.084 5.51e-05 ***
## Fjobteacher    2.1208     0.5664   3.745 0.000212 ***
## studytime      0.5661     0.1880   3.012 0.002788 ** 
## failures      -0.9418     0.2358  -3.994 7.94e-05 ***
## schoolsupyes  -2.3196     0.4464  -5.197 3.49e-07 ***
## famsupyes     -0.6310     0.3116  -2.025 0.043653 *  
## goout         -0.4774     0.1366  -3.494 0.000538 ***
## health        -0.2617     0.1060  -2.468 0.014074 *  
## absences      -0.3240     0.1328  -2.440 0.015181 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.743 on 343 degrees of freedom
## Multiple R-squared:  0.3042, Adjusted R-squared:  0.2778 
## F-statistic: 11.53 on 13 and 343 DF,  p-value: < 2.2e-16

This model so far gives us the best R2 adjusted and the most significant variables.

Diagnostics of modelback

stud_resid = rstudent(modelback)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = modelback$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
##        RESID      ABS      PRED
## 1   2.728324 2.728324 11.653339
## 2   2.714544 2.714544 11.763080
## 3   2.594468 2.594468 10.103717
## 4   2.493234 2.493234 12.314119
## 5   2.350334 2.350334 11.859641
## 6   2.332082 2.332082 11.696637
## 7   2.254386 2.254386 11.999121
## 8   2.196549 2.196549 12.233189
## 9  -2.182972 2.182972 11.906954
## 10 -2.182586 2.182586 10.906406
## 11  2.128817 2.128817 12.355736
## 12 -2.100461 2.100461 11.622068
## 13 -2.052568 2.052568 11.514590
## 14  2.048496 2.048496  8.495680
## 15  2.035560 2.035560  7.549169
## 16  2.005830 2.005830 10.601775
nortest::ad.test(dat_resid$RESID)
## 
##  Anderson-Darling normality test
## 
## data:  dat_resid$RESID
## A = 0.55571, p-value = 0.1506
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)

Model 4: Interaction Model (model.int)

Of all possible interaction terms, several were significant. Among those, the most “interpretable” are studytime:schoolsupyes, studytime:famsupyes, studytime:goout, and schoolsupyes:goout. Interestingly, it appears that the effects of studytime are reduced when one goes out a lot; perhaps fatigue reduces the effectiveness of studying.

model.int <- lm(G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher + studytime + 
                  failures + schoolsupyes + famsupyes + paidyes + internetyes + goout + health +
                  absences + studytime:schoolsupyes + studytime:famsupyes + studytime:goout + schoolsupyes:goout, 
   data = student_d)
summary(model.int)
## 
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher + 
##     studytime + failures + schoolsupyes + famsupyes + paidyes + 
##     internetyes + goout + health + absences + studytime:schoolsupyes + 
##     studytime:famsupyes + studytime:goout + schoolsupyes:goout, 
##     data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3845 -1.7486 -0.0282  1.8463  7.2936 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             10.9712     1.3878   7.906 3.78e-14 ***
## sexM                     0.5683     0.3133   1.814 0.070602 .  
## Mjobhealth               1.8297     0.5264   3.476 0.000576 ***
## Mjobservices             1.4148     0.3406   4.154 4.14e-05 ***
## Fjobteacher              2.0923     0.5615   3.727 0.000227 ***
## studytime                1.4287     0.5851   2.442 0.015131 *  
## failures                -1.1644     0.2323  -5.012 8.70e-07 ***
## schoolsupyes             3.1224     1.6756   1.863 0.063259 .  
## famsupyes               -1.5353     0.8059  -1.905 0.057613 .  
## paidyes                 -0.4169     0.3101  -1.344 0.179735    
## internetyes              0.6524     0.4080   1.599 0.110768    
## goout                    0.2661     0.3562   0.747 0.455607    
## health                  -0.2340     0.1052  -2.225 0.026749 *  
## absences                -0.3547     0.1313  -2.702 0.007231 ** 
## studytime:schoolsupyes  -1.4887     0.5333  -2.791 0.005545 ** 
## studytime:famsupyes      0.5053     0.3740   1.351 0.177549    
## studytime:goout         -0.3400     0.1653  -2.057 0.040417 *  
## schoolsupyes:goout      -0.7082     0.3604  -1.965 0.050213 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.708 on 339 degrees of freedom
## Multiple R-squared:  0.3296, Adjusted R-squared:  0.296 
## F-statistic: 9.804 on 17 and 339 DF,  p-value: < 2.2e-16

Diagnostics of model.int

stud_resid = rstudent(model.int)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model.int$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
##        RESID      ABS     PRED
## 1   2.767414 2.767414 11.70644
## 2   2.617816 2.617816 12.03399
## 3   2.550088 2.550088 12.30526
## 4  -2.450654 2.450654 11.38453
## 5   2.406855 2.406855 11.82766
## 6   2.371032 2.371032 10.73981
## 7   2.347761 2.347761 11.75319
## 8   2.327249 2.327249 11.88466
## 9   2.318470 2.318470  7.90480
## 10 -2.288029 2.288029 12.10269
## 11 -2.231194 2.231194 10.94818
## 12 -2.121865 2.121865 11.65760
## 13  2.037154 2.037154 12.63631
nortest::ad.test(dat_resid$RESID)
## 
##  Anderson-Darling normality test
## 
## data:  dat_resid$RESID
## A = 0.46134, p-value = 0.2578

Normality assumption is justified for model.int

p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)

Model 5: Reduced Stepwise Model

lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + 
    Fjobteacher + studytime + failures + schoolsupyes + famsupyes + 
    goout + health + absences, data = student_d) %>% 
  summary()
## 
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher + 
##     studytime + failures + schoolsupyes + famsupyes + goout + 
##     health + absences, data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8085 -1.9348 -0.1866  1.9893  7.2760 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.3782     0.7956  16.816  < 2e-16 ***
## sexM           0.6779     0.3164   2.143 0.032841 *  
## Mjobhealth     1.9787     0.5307   3.729 0.000225 ***
## Mjobservices   1.5156     0.3436   4.411 1.38e-05 ***
## Fjobteacher    2.1484     0.5694   3.773 0.000190 ***
## studytime      0.5406     0.1888   2.863 0.004453 ** 
## failures      -1.0778     0.2290  -4.706 3.66e-06 ***
## schoolsupyes  -2.1153     0.4326  -4.890 1.55e-06 ***
## famsupyes     -0.5913     0.3120  -1.895 0.058899 .  
## goout         -0.4831     0.1365  -3.539 0.000457 ***
## health        -0.2596     0.1065  -2.437 0.015334 *  
## absences      -0.3581     0.1314  -2.725 0.006755 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.76 on 345 degrees of freedom
## Multiple R-squared:  0.2914, Adjusted R-squared:  0.2688 
## F-statistic:  12.9 on 11 and 345 DF,  p-value: < 2.2e-16
modelsmall <- lm(G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures + schoolsupyes + goout, data = student_d)
summary(modelsmall)
## 
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures + 
##     schoolsupyes + goout, data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.5735 -1.8793 -0.0725  1.9254  7.9254 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.0704     0.4698  27.819  < 2e-16 ***
## Mjobhealth     1.7801     0.5422   3.283 0.001130 ** 
## Mjobservices   1.3399     0.3522   3.804 0.000168 ***
## Fjobteacher    2.0188     0.5860   3.445 0.000640 ***
## failures      -1.2676     0.2301  -5.508 7.05e-08 ***
## schoolsupyes  -2.2548     0.4384  -5.143 4.51e-07 ***
## goout         -0.4989     0.1405  -3.552 0.000435 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.855 on 350 degrees of freedom
## Multiple R-squared:  0.231,  Adjusted R-squared:  0.2178 
## F-statistic: 17.52 on 6 and 350 DF,  p-value: < 2.2e-16

Diagnostics of modelsmall

stud_resid_small = rstudent(modelsmall)
dat_resid_small <- data.frame(RESID = stud_resid_small, ABS = abs(stud_resid_small), PRED = modelsmall$fitted.values)
dat_resid_small %>% arrange(-ABS) %>% filter(ABS > 2)
##        RESID      ABS     PRED
## 1   2.814574 2.814574 11.07460
## 2   2.630448 2.630448 11.57355
## 3   2.453589 2.453589 12.07249
## 4  -2.334139 2.334139 12.57143
## 5  -2.323367 2.323367 11.57355
## 6   2.281179 2.281179 10.57566
## 7  -2.277173 2.277173 12.41453
## 8   2.270601 2.270601 11.57355
## 9   2.270601 2.270601 11.57355
## 10  2.205028 2.205028 13.85261
## 11  2.166041 2.166041 11.91558
## 12 -2.146474 2.146474 12.07249
## 13 -2.146474 2.146474 12.07249
nortest::ad.test(dat_resid_small$RESID)
## 
##  Anderson-Darling normality test
## 
## data:  dat_resid_small$RESID
## A = 0.3526, p-value = 0.4646

Small model satisfies conditions for Gaussian assumption.

p1.small <- ggplot(data = dat_resid_small) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2.small <- ggplot(data = dat_resid_small) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1.small, p2.small)

The small model also appears to satisfy the diagnostic checks for homogeneity and linearity. Existence is trivial, but we may to discuss limitations of the independence assumption (maybe kids form study groups or grades are based on a curve)

Model 6: Semi-Reduced Stepwise Model

model6 <- lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + 
    Fjobteacher + studytime + failures + schoolsupyes + famsupyes + 
    goout + health + absences, data = student_d)
summary(model6)
## 
## Call:
## lm(formula = G3 ~ sexM + Mjobhealth + Mjobservices + Fjobteacher + 
##     studytime + failures + schoolsupyes + famsupyes + goout + 
##     health + absences, data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.8085 -1.9348 -0.1866  1.9893  7.2760 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   13.3782     0.7956  16.816  < 2e-16 ***
## sexM           0.6779     0.3164   2.143 0.032841 *  
## Mjobhealth     1.9787     0.5307   3.729 0.000225 ***
## Mjobservices   1.5156     0.3436   4.411 1.38e-05 ***
## Fjobteacher    2.1484     0.5694   3.773 0.000190 ***
## studytime      0.5406     0.1888   2.863 0.004453 ** 
## failures      -1.0778     0.2290  -4.706 3.66e-06 ***
## schoolsupyes  -2.1153     0.4326  -4.890 1.55e-06 ***
## famsupyes     -0.5913     0.3120  -1.895 0.058899 .  
## goout         -0.4831     0.1365  -3.539 0.000457 ***
## health        -0.2596     0.1065  -2.437 0.015334 *  
## absences      -0.3581     0.1314  -2.725 0.006755 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.76 on 345 degrees of freedom
## Multiple R-squared:  0.2914, Adjusted R-squared:  0.2688 
## F-statistic:  12.9 on 11 and 345 DF,  p-value: < 2.2e-16

Diagnostics of Model 6

** keep this or should we not use stuff from the subset analysis?

stud_resid = rstudent(model6)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = model6$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
##        RESID      ABS      PRED
## 1   2.678210 2.678210 11.724004
## 2   2.669086 2.669086 11.777365
## 3   2.428470 2.428470 11.392824
## 4   2.421910 2.421910 12.449989
## 5   2.406835 2.406835 11.671522
## 6   2.300255 2.300255 10.796698
## 7   2.196107 2.196107 12.107675
## 8  -2.131205 2.131205 11.808485
## 9   2.115176 2.115176 12.240829
## 10  2.091643 2.091643 12.401337
## 11  2.061395 2.061395  8.423945
## 12 -2.039452 2.039452 10.565676
## 13  2.037242 2.037242 13.541178
## 14  2.033488 2.033488 12.575063
## 15  2.020778 2.020778 10.525226
nortest::ad.test(dat_resid$RESID)
## 
##  Anderson-Darling normality test
## 
## data:  dat_resid$RESID
## A = 0.74385, p-value = 0.05225
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)

Model 7: Reduced Interaction Model

smallint <- lm(G3 ~ Mjobhealth + Mjobservices + Fjobteacher + studytime + failures + schoolsupyes + absences + studytime*schoolsupyes, data = student_d)
summary(smallint)
## 
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + studytime + 
##     failures + schoolsupyes + absences + studytime * schoolsupyes, 
##     data = student_d)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3169 -1.8571 -0.0307  1.9693  8.5644 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             11.1451     0.5269  21.153  < 2e-16 ***
## Mjobhealth               1.5611     0.5367   2.909 0.003861 ** 
## Mjobservices             1.3264     0.3497   3.793 0.000176 ***
## Fjobteacher              2.0558     0.5830   3.526 0.000478 ***
## studytime                0.5550     0.1959   2.833 0.004884 ** 
## failures                -1.1399     0.2328  -4.896  1.5e-06 ***
## schoolsupyes             0.1176     1.2027   0.098 0.922180    
## absences                -0.4215     0.1331  -3.168 0.001672 ** 
## studytime:schoolsupyes  -1.1585     0.5398  -2.146 0.032571 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.83 on 348 degrees of freedom
## Multiple R-squared:  0.2484, Adjusted R-squared:  0.2312 
## F-statistic: 14.38 on 8 and 348 DF,  p-value: < 2.2e-16

Diagnostics of Reduced Interaction Model

stud_resid = rstudent(smallint)
dat_resid <- data.frame(RESID = stud_resid, ABS = abs(stud_resid), PRED = smallint$fitted.values)
dat_resid %>% arrange(-ABS) %>% filter(ABS > 2)
##        RESID      ABS     PRED
## 1   3.084691 3.084691 10.43556
## 2   2.558145 2.558145 10.85706
## 3   2.515938 2.515938 11.96703
## 4   2.363309 2.363309 12.38852
## 5   2.362362 2.362362 11.47392
## 6   2.273524 2.273524 11.83483
## 7  -2.262882 2.262882 12.31691
## 8   2.213032 2.213032 12.91281
## 9   2.143890 2.143890 10.99055
## 10 -2.137058 2.137058 10.99055
## 11 -2.105066 2.105066 10.72103
## 12  2.088324 2.088324 13.15990
## 13 -2.077790 2.077790 11.83354
## 14 -2.077790 2.077790 11.83354
## 15 -2.077790 2.077790 11.83354
## 16 -2.077790 2.077790 11.83354
## 17  2.001355 2.001355 12.38852
## 18  2.000896 2.000896 14.50464
nortest::ad.test(dat_resid$RESID)
## 
##  Anderson-Darling normality test
## 
## data:  dat_resid$RESID
## A = 0.53226, p-value = 0.1725
p1 <- ggplot(data = dat_resid) + geom_histogram(aes(RESID), bins = 25) + labs(x = "Studentized Residuals")
p2 <- ggplot(data = dat_resid) + geom_point(aes(PRED, RESID)) + geom_hline(aes(yintercept = 0)) + labs(x = "Predicted Values", y = "Studentized Residuals")
cowplot::plot_grid(p1, p2)

Model 6 All Subjects: Sensitivity analysis for dropping all subjects with G3=0

model6_all <- lm(G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures + schoolsupyes + goout + health + absences, data = student_d_all)
summary(model6_all)
## 
## Call:
## lm(formula = G3 ~ Mjobhealth + Mjobservices + Fjobteacher + failures + 
##     schoolsupyes + goout + health + absences, data = student_d_all)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.1619  -1.7970   0.4408   2.7480   8.4881 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   12.1885     0.8886  13.716  < 2e-16 ***
## Mjobhealth     2.2767     0.7659   2.972  0.00314 ** 
## Mjobservices   1.5218     0.4909   3.100  0.00208 ** 
## Fjobteacher    1.4156     0.8087   1.751  0.08083 .  
## failures      -2.1902     0.2891  -7.576 2.66e-13 ***
## schoolsupyes  -1.2548     0.6285  -1.996  0.04659 *  
## goout         -0.4418     0.1910  -2.313  0.02123 *  
## health        -0.1973     0.1523  -1.295  0.19609    
## absences       0.2879     0.1885   1.527  0.12755    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.164 on 386 degrees of freedom
## Multiple R-squared:  0.1907, Adjusted R-squared:  0.1739 
## F-statistic: 11.37 on 8 and 386 DF,  p-value: 1.699e-14

Extreme Residual Analysis

Influence – Cook’s Distance - Values near 1 are bad

Outlier – Leverage Values - Values greater than 2*p/n are bad

p <- modelback$coefficients %>% length()
n <- nrow(student_d)
levcutoff <- 2*p/n
lev <- hatvalues(modelback)
cook <- cooks.distance(modelback)

take out values that are outlliers and run the regression again?

We decide to further remove a few outliers informally to increase the accuracy of our analysis. For example, we noticed that there were five students in the 20-22 age range. Since ages 15-18 had 70+ students per age, and age 19 had 19 students, we decided to remove the 20-22 year old students. For students of this age to be in secondary school, it is likely that they had certain extenuating circumstances that may not be represented in the dataset. Moreover, we simply have too few observations in this age range to be able to accurately analyze the effects of age on outcome in the 20+ region. Therefore, we went ahead and removed these students.

table(student$age)
## 
##  0  1  2  3  4  5  6  7 
## 76 97 90 70 19  3  1  1
student <- student %>%
  filter(age < 20-15)

Model Selection and Interpretation

Aside from looking at the Adjusted R2, can look at the AIC or SBC/BIC (smaller is better in both cases) can add more/different models to this code, but here is the basis

aic <- c(AIC(model1), AIC(model2), AIC(modelback), AIC(model.int), AIC(modelsmall), AIC(smallint))
bic <- c(BIC(model1), BIC(model2), BIC(modelback), BIC(model.int), BIC(modelsmall), BIC(smallint))
adjR <- c( summary(model1)$adj.r.sq,  summary(model2)$adj.r.sq,  summary(modelback)$adj.r.sq, summary(model.int)$adj.r.sq, summary(modelsmall)$adj.r.sq, summary(smallint)$adj.r.sq)
num.pred <- c(model1$coefficients %>% length, model2$coefficients %>% length, modelback$coefficients %>% length, model.int$coefficients %>% length, modelsmall$coefficients %>% length, smallint$coefficients %>% length)

selection <- data.frame(model = c("Model 1: Full Model", "Model 2: No Intercept", "Model 3: Stepwise Model", "Model 4: Stepwise Model + Interaction Terms", "Model 5: Reduced Stepwise Model", "Model 6: Reduced Interaction Model"), Parameters = num.pred, AICs = aic, BICs = bic, Adjusted.R.Squared = adjR)
selection %>% knitr::kable(align = c("c", "c"))
model Parameters AICs BICs Adjusted.R.Squared
Model 1: Full Model 37 1782.011 1929.365 0.2542165
Model 2: No Intercept 36 1834.145 1977.621 0.9370466
Model 3: Stepwise Model 14 1749.316 1807.482 0.2778046
Model 4: Stepwise Model + Interaction Terms 18 1744.031 1817.708 0.2959758
Model 5: Reduced Stepwise Model 7 1771.009 1802.031 0.2178292
Model 6: Reduced Interaction Model 9 1766.823 1805.600 0.2311681

Conclusion

References

[1] P. Cortez and A. Silva. Using Data Mining to Predict Secondary School Student Performance. In A. Brito and J. Teixeira Eds., Proceedings of 5th Future Business Technology Conference (FUBUTEC 2008) pp. 5-12, Porto, Portugal, April, 2008, EUROSIS, ISBN 978-9077381-39-7.